feat: add bmad-eval-runner skill with isolation, dependency staging, and full docs by bmadcode · Pull Request #84 · bmad-code-org/bmad-builder

bmadcode · 2026-05-10T00:02:24Z

Summary

Adds the bmad-eval-runner skill plus complete documentation. The runner evaluates a skill's behavior in an isolated workspace (Docker preferred, local fallback) and grades the result against eval-author expectations.

Skill (5 commits)

Initial skill: bmad-eval-runner with claude -p based execution, isolation strategies, and discovery
Credential staging + correct trigger detection: macOS Keychain credential staging into the sandbox; synthetic skill placed at .claude/skills/<unique>/SKILL.md so the Skill tool can actually fire
Setup overlay system: base (evals/setup/) and per-eval (evals/<id>/setup/) directories rsynced into the workspace before the skill stages, enabling dependency skills to be available inside the sandbox
Trigger fix: --dangerously-skip-permissions added to claude -p invocations so the Skill tool can read SKILL.md (fixes 0% trigger rate)
Per-eval timeout override: evals.json entries can set "timeout": N to override the runner's default

Docs

explanation/what-are-evals.md: artifact vs trigger evals; output vs transcript grading; best practices; worked example pointing at bmad-product-brief
explanation/why-bmad-eval-runner.md: isolation, dependency staging, trigger detection, permanent artifacts
how-to/install-docker-for-evals.md: Docker Desktop setup with credential-safety notes
how-to/run-evals-against-a-skill.md: 5-step run flow with worked example
reference/eval-format.md: complete schema (fixtures, setup overlays, per-eval timeout)
_diagrams/eval-test-types.excalidraw + Playwright renderer (render.mjs + render.html)
public/img/eval-test-types.png: rendered architecture diagram

Test Plan

Eval runner executed end-to-end against bmad-product-brief (17 artifact evals, all 17 ran; 16 passed, 1 timeout that was traced to a too-tight per-eval timeout, fixed by the new override field)
Trigger detection verified: synthetic skill firing observed in stream-JSON
Setup overlay confirmed: dependency skills (bmad-distillator, editorial review skills) available inside the sandbox
Docs validated: zero em dashes, zero banned vocabulary, all cross-doc links land
Diagram renders cleanly via Playwright (docs/_diagrams/render.mjs)

New skill for running a target skill's evals in a clean, isolated environment. Supports both artifact evals (evals.json with expectations) and trigger evals (triggers.json with should_trigger). Adapted from Anthropic's skill-creator eval pipeline (run_eval.py, grader.md, generate_review.py). Isolation strategy: - Docker preferred: each eval runs in a fresh bmad-eval-runner:latest container with HOME pointed at an empty in-container dir, no host CLAUDE.md or auto-memory bleed-through. Image built on first run. - Local fallback: ~/bmad-evals/<run-id>/<eval-id>/ with HOME override to a clean .home/ directory. Best-effort; user is told. Artifacts (transcript, files Claude wrote, metrics, grading) are retained permanently per run so users can review what happened, not just whether it passed. Layout: SKILL.md outcome-driven entry references/isolation.md Docker + local strategies references/eval-formats.md evals.json + triggers.json schemas scripts/run_evals.py artifact runner scripts/run_triggers.py trigger runner (adapted from Anthropic) scripts/docker_setup.py Docker detection + image build scripts/generate_report.py aggregate HTML report scripts/utils.py shared helpers agents/grader.md judge subagent assets/Dockerfile clean Claude Code image

Three fixes from running the runner end-to-end against bmad-product-brief: 1. Stage Claude Code OAuth credentials into each isolated workspace. Both isolation modes override HOME, so the subprocess can't read the host's ~/.claude/ and the macOS Keychain ACL prevents it from reading the credential directly. The parent process (which owns the ACL) now reads "Claude Code-credentials" via `security find-generic-password` once at import, then writes it as .credentials.json into each workspace's .claude/ before launching claude -p. ANTHROPIC_API_KEY passthrough still works as a fallback for non-macOS hosts. 2. Trigger detection: place the synthetic skill at .claude/skills/<name>/ SKILL.md instead of .claude/commands/<name>.md. Slash commands do not surface as Skill tool calls, which is why the previous implementation (matching Anthropic's reference run_eval.py) reported 0% trigger rates for every should-trigger query. Real skills under .claude/skills/ do fire the Skill tool, letting the existing detector observe genuine trigger events. 3. Docker credential mount: write to a dedicated <eval-dir>/creds/ directory so the container mount holds exactly one file at the expected path (`/creds/.credentials.json`). Mounting eval-dir directly would expose all run output and required the container to know an undocumented dot-prefix filename. isolation.md and SKILL.md updated to document the auth flow, the local-mode trigger leak (host's installed skills can bleed in via cwd discovery despite HOME override — prefer Docker for triggers), and why real-skill placement is correct vs. slash-command placement. Multi-turn workflow handling for non-headless skills is still TODO.

- Setup overlay system: rsync evals/setup/ (base) and evals/<id>/setup/ (per-eval) onto each workspace before skill staging, enabling dependency skills and _bmad/ config to be available inside the sandbox - Add parse_skill_dependencies, discover_setup_dirs, apply_setup_overlay to utils.py; wire through run_evals.py for both local and Docker modes - Fix 0% trigger rate: add --dangerously-skip-permissions to all claude -p invocations in run_triggers.py (without it Skill tool cannot read SKILL.md) - Upgrade grader.md with richer transcript parsing guidance (tool-call patterns, phase ordering, read-only enforcement, JSON block extraction) - Expand eval-formats.md reference with setup overlay and dependency docs - Default workers bumped to 8 - Add pty_runner.py (experimental; not wired into main flow)

…imeout field

…ad-eval-runner - explanation/what-are-evals.md: artifact vs trigger evals; output vs transcript grading - explanation/why-bmad-eval-runner.md: isolation, dependency staging, real triggers, permanent artifacts - how-to/install-docker-for-evals.md: Docker Desktop setup with credential-safety notes - how-to/run-evals-against-a-skill.md: 5-step run flow with brief eval suite as worked example - reference/eval-format.md: complete schema for evals.json + triggers.json (fixtures, setup overlays, per-eval timeout) - _diagrams/eval-test-types.excalidraw: source diagram with Playwright renderer (render.mjs + render.html) - public/img/eval-test-types.png: rendered architecture diagram embedded in what-are-evals.md - update explanation/index.md and reference/index.md sidebars

coderabbitai · 2026-05-10T00:02:36Z

Warning

Rate limit exceeded

@bmadcode has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 58 minutes and 5 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a03bb12a-a2d5-4ead-a84c-b2ac2e2efa46

📥 Commits

Reviewing files that changed from the base of the PR and between 86033fc and 12effa6.

⛔ Files ignored due to path filters (1)

website/public/img/eval-test-types.png is excluded by !**/*.png

📒 Files selected for processing (22)

docs/_diagrams/README.md
docs/_diagrams/eval-test-types.excalidraw
docs/_diagrams/render.html
docs/_diagrams/render.mjs
docs/explanation/index.md
docs/explanation/what-are-evals.md
docs/explanation/why-bmad-eval-runner.md
docs/how-to/install-docker-for-evals.md
docs/how-to/run-evals-against-a-skill.md
docs/reference/eval-format.md
docs/reference/index.md
skills/bmad-eval-runner/SKILL.md
skills/bmad-eval-runner/agents/grader.md
skills/bmad-eval-runner/assets/Dockerfile
skills/bmad-eval-runner/references/eval-formats.md
skills/bmad-eval-runner/references/isolation.md
skills/bmad-eval-runner/scripts/docker_setup.py
skills/bmad-eval-runner/scripts/generate_report.py
skills/bmad-eval-runner/scripts/pty_runner.py
skills/bmad-eval-runner/scripts/run_evals.py
skills/bmad-eval-runner/scripts/run_triggers.py
skills/bmad-eval-runner/scripts/utils.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch eval-runner

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

augmentcode · 2026-05-10T00:04:50Z

🤖 Augment PR Summary

Summary: Adds the bmad-eval-runner skill to run a skill’s artifact and trigger eval suites in isolated workspaces and emit a permanent, inspectable run report.

Changes:

Introduced the bmad-eval-runner skill definition plus a dedicated grader subagent prompt.
Implemented Python runners for artifact evals (run_evals.py) and trigger evals (run_triggers.py) with Docker-preferred isolation and a local fallback.
Added Docker image management (docker_setup.py) and a minimal runner image (assets/Dockerfile).
Added an aggregate HTML report generator (generate_report.py) that combines execution results, per-eval grading, and trigger rates.
Added/updated docs covering eval concepts, why isolation matters, Docker installation, how to run evals, and the full eval schema.
Added diagram sources and a Playwright-based renderer to produce committed PNGs for the documentation.

Technical Notes: Workspaces are staged via project rsync + setup overlays + fixtures; runs capture stream-JSON transcripts and support per-eval timeout overrides.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 5 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-05-10T00:04:51Z

+
+const sceneJson = JSON.parse(readFileSync(inPath, "utf-8"));
+
+const htmlPath = resolve(fileURLToPath(import.meta.url), "..", "excalidraw_render.html");


htmlPath points at excalidraw_render.html, but this PR adds docs/_diagrams/render.html, so the renderer won’t be able to page.goto() the HTML it expects and will fail/hang.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-05-10T00:04:51Z

+    workspace_snapshot_before = snapshot_files(workspace_project)
+
+    home_dir = workspace_root / ".home"
+    stage_credentials(home_dir / ".claude", _KEYCHAIN_CREDS)


Staging the macOS Keychain OAuth JSON into the per-eval run directory (workspace/.home/.claude/.credentials.json and eval_dir/creds/.credentials.json) appears to persist credentials in the “artifacts are forever” run folder, which is a significant secret-leak risk if runs are backed up or shared.

Severity: high

Other Locations

skills/bmad-eval-runner/scripts/run_evals.py:232

skills/bmad-eval-runner/scripts/run_triggers.py:148

skills/bmad-eval-runner/scripts/run_triggers.py:225

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-05-10T00:04:51Z

+        if e.stderr:
+            stderr_tail += "\n" + e.stderr.decode("utf-8", errors="replace")[-2000:]
+
+    new_files = diff_workspace(workspace_project, workspace_snapshot_before)


Local artifact capture only includes newly-created paths (after - before), so edits to existing files (e.g., Update/Validate flows) won’t be reflected in artifacts/ for grading; in Docker mode the container script rsyncs the entire workspace (including the whole project), which can massively bloat runs and dilute what the skill actually produced.

Severity: medium

Other Locations

skills/bmad-eval-runner/scripts/run_evals.py:259

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-05-10T00:04:51Z

+    for src in setup_dirs:
+        if not src.is_dir():
+            continue
+        subprocess.run(


apply_setup_overlay() shells out to rsync unconditionally and ignores failures (check=False), so on hosts without rsync (or if rsync errors) overlays can silently not apply and dependency staging can fail in hard-to-debug ways.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-05-10T00:04:51Z

+                        pending_tool = name
+                        accumulated_json = ""
+                    else:
+                        return False, ""


parse_stream_for_trigger() returns False immediately when it sees a tool_use that isn’t Skill/Read (and also returns False after the first assistant event lacking the tool), which can create false negatives if the synthetic skill fires later in the stream.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

bmadcode added 5 commits May 9, 2026 13:42

feat(eval-runner): support per-eval timeout override via evals.json t…

91f9742

…imeout field

style: prettier format docs/_diagrams/render.mjs

12effa6

bmadcode merged commit 72628e2 into main May 10, 2026
4 checks passed

augmentcode Bot reviewed May 10, 2026

View reviewed changes

bmadcode deleted the eval-runner branch May 10, 2026 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add bmad-eval-runner skill with isolation, dependency staging, and full docs#84

feat: add bmad-eval-runner skill with isolation, dependency staging, and full docs#84
bmadcode merged 6 commits intomainfrom
eval-runner

bmadcode commented May 10, 2026

Uh oh!

coderabbitai Bot commented May 10, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

augmentcode Bot commented May 10, 2026

Uh oh!

Uh oh!

augmentcode Bot left a comment

Uh oh!

augmentcode Bot May 10, 2026 •

edited

Loading

Uh oh!

augmentcode Bot May 10, 2026 •

edited

Loading

Uh oh!

augmentcode Bot May 10, 2026 •

edited

Loading

Uh oh!

augmentcode Bot May 10, 2026 •

edited

Loading

Uh oh!

augmentcode Bot May 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		const sceneJson = JSON.parse(readFileSync(inPath, "utf-8"));

		const htmlPath = resolve(fileURLToPath(import.meta.url), "..", "excalidraw_render.html");

Uh oh!

Conversation

bmadcode commented May 10, 2026

Summary

Skill (5 commits)

Docs

Test Plan

Uh oh!

coderabbitai Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

augmentcode Bot commented May 10, 2026

Uh oh!

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 10, 2026 •

edited

Loading

augmentcode Bot May 10, 2026 •

edited

Loading

augmentcode Bot May 10, 2026 •

edited

Loading

augmentcode Bot May 10, 2026 •

edited

Loading

augmentcode Bot May 10, 2026 •

edited

Loading

augmentcode Bot May 10, 2026 •

edited

Loading